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Abstract 

We present Sapporo, a library for performing high-precision gravitational TV-body simulations on NVIDIA 
Graphical Processing Units (GPUs). Our library mimics the GRAPE-6 library, and A''-body codes cur- 
rently running on GRAPE-6 can switch to Sapporo by a simple relinking of the library. The precision 
of our library is comparable to that of GRAPE-6, even though internally the GPU hardware is limited 
to single precision arithmetics. This limitation is effectively overcome by emulating double precision for 
calculating the distance between particles. The performance loss of this operation is small ( ^ 20%) 
compared to the advantage of being able to run at high precision. We tested the library using several 
GRAPE-6-enabled N-body codes, in particular with Starlab and phiGRAPE. We measured peak per- 
formance of 800Gflop/s for running with 10^ particles on a PC with four commercial G92 architecture 
GPUs (two GeForce 9800GX2). As a production test, we simulated a 32k Plummer model with equal 
mass stars well beyond core collapse. The simulation took 41 days, during which the mean performance 
was 113Gflop/s. The GPU did not show any problems from running in a production environment for 
such an extended period of time. 



1. Introduction 

Graphical processing units (GPUs) are quickly becoming main stream in com putational science. The 
introduction of Compute Unified Device Architecture (CUDA, Fernandol , 2004), in which GPUs can b e 



programmed effectively, has generated a paradigm shift in scientific computing (jHoekstra et al.l . 120071) . 
Modern GPUs are greener in terms of CO2 production, have a smaller footprint, are cheaper, and as 
easy to program as traditional parallel computers. In addition, you will not have a waiting queue when 
running large simulations on your local CPU-equipped workstation. 

Newtonian stellar dynamics is traditiona lly on the forefront of high-performance computing. The 
first de dicated Newtonian solver (|Ap plegat e et al. . 119861 was used to study the stability of the solar 
system (jSussman and Wisdoml . 119921 ) . And soon even faster specialized hardware was introduced by the 
inauguration of the GRAPE family of computers, which have an impressive history of breaking computing 
speed records (Makino._and Taiji, 1998). 



N owadays, the CPUs ar e being used in various scienti fic areas, su ch as molecular dynamics ([Anderson et al 

2008t Ivan Meel et al.l . [2008l ). solving Kepler's equations (jPordl . I2OO9I ) and Newtonian iV-body simulations. 
Solving the Newtonian TV-body problem wit h GPUs started in th e early 2000s by adopting a shared time 



step algorithm with a 2nd order integrator ( Nvland et al. . 2004( ). A few years later this algorithm wa s 



improved to include individual time steps and a higher order integrator (Portegies Zwart et al.l . 120071 ). 
in a code that was written in the device specific language Cg (Ferna ndo and Kilgard, 2003). The per- 
form ance was still relati v ely lo w compared to l ater implementa t ions i n CUDA via the Cunbody pack- 
age (Hamada and litaka', '2 0071). Kir in library ( Belleman et al. , l2008h . and the Yebisu A^-body code 
dNitadori and Makino, 20080 Nitado ri. 2009). The main problem with the two former implementations 
was the limited accuracy of the GPU, which only enabled single precision. In part this problem was 
solved with the introduction of the double precision GPU (GTX280), but at a dramaticperformance-hit 
as only a limited number of processor pipelines supported double-precision calculationtjj. 
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In this paper we introduce SapporcH, a library, which is written in CUDA, for running gravitational 
A^-body simulation s on NVIDIA GPUs. The Sapporo library closely matches the calling sequence of the 



GRAPE-6 library (jMakino et alj . l2003f ). and codes which are already running on a GRAPE-6 can be 
immediately be run on GPUs without any modifications to their source code. 

Here, we describe the implementation of our library, and present both accuracy and performance 
measurements u sing Sapporo. For the latter , we use two direct A^-body si mulations environm ent, one 
being Starlab ( Portegies Zwart et all . l200l[ ) and the other is phiGRAPE dHarfst et al. . 2007 ) as it is 



implemented in the Multi Scale Software Environment (MUSE. IPortegies Zwart et al.l . 2009[ ) 



Our new implementation of the A"-body force calculation in CUDA is an important step towards 
high-precision direct A^-body simulations using GPUs, as the library is more flexible and more general 
than previous implementations. In addition we identify the operations in the code where double precision 
accuracy is most important and implement double precision arithmetic in those locations. The cost for 
doing this is limited to a 20% loss in performance (but overall performance is still very high). 

2. Implementation 

We have designed a library to calculate gravitational forces on a GPU in order to accelerate A^- 
body simulations. The library can be used in combination with a standard 4*''-order predictor-corrector 
Hermite integration scheme (Makino & Aarseth 1992), either with shared or block time steps. Such a 
scheme consists of three essential steps: predictor, force calculation, and corrector. In the predictor step, 
the positions and velocities of all particles are predicted for the next time step. Then, the gravitational 
accelerations and their first derivatives (jerks) are calculated using the predicted positions and velocities. 
Finally, the predicted positions and velocities are corrected using the newly computed accelerations and 
jerks. In the case of a block time step scheme, the last two steps are only executed on a block of active 
particles. In the following, we will focus on block time steps, since shared time step scheme can be 
considered a variant of the block time step scheme. 

In the case of the block time steps, the system is divided into i-particles and j-particles, which are 
the sinks and sources of the gravitational forceH, respectively. Within the Hermite scheme, we take the 
following subsequent steps: we predict the positions and velocities of the i- and j-particles (predictor 
step); calculate the gravitational forces exerted by the j-particles on the i-particles (calculation step); 
correct the positions and velocities of the «-particles (corrector step). 

The actual parallelisa tion of the force calcula t ion is not too difficul t and many examples for parallel 
A^-body algorithms exist ( Gualandris et al. . 2007 : Dorband et al.l . [20031 ). Implementing the algorithm on 



a GPU is a little more challenging due to the specific design of GPUs. For example, a G80/G92 NVIDIA 
GPU consist of 16 multi-processors (MPs) each being able to execute up to 768 threads in parallel. A 
parallel, SIMD (single instruction multiple data), element on such a GPU is called a warp. A warp is a 
set of 32 threads which execute the same instruction but operate on different data, and up to 24 warps 
can be executed in parallel on each MP. In total, a single G80/G92 GPU is able to execute up to 12288 
threads in parallel, which means that a program which runs on a GPU, called a kernel, should be able 
to efficiently exploit such a high degree of parallelism. 

In addition, the GPU design impose a strict memory access pattern on such a kernel. For example, 
a GPU has two types of memory: global memory, which is an equivalent of RAM on the CPU, and the 
shared memory, which is equivalent to Ll-cache on the CPU. The global memory is further subdivided 
into local memory which is allocated on a per thread basis, and both texture and constant cached read- 
only memory. The major difference between the shared and the global memory is the access latency. 
Access to an element of the shared memory is as fast as access to a register, whereas access to an element 
in the global memory has a latency of 400-600 cycles. Since the data initially reside in the global memory, 
each of the parallel thread should cooperatively load the data to the shared memory in a specific pattern 
in order to reduce the access latency. The threads can then eflticiently operate on the data in the shared 
memory. 

■^Tho Sapporo library is free and can be downloaded from http://modesta.science.uva.nl. 

•^In the following, by gravitational forces we imply gravitational accelerations, potentials and jerks. 
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Figure 1: Decomposition of the tasks between the CPU and the GPU. The CPU is responsible for the communication, 
prediction of the i-particles and the correction step, whereas the prediction of the j-particles and the force calculation step 
are carried out on the GPU. 



2.1. Task decomposition between the CPU and the GPU 

At first, we estimate the complexity of eacli of tlie steps in the Hermite scheme. We assume that the 
number of j-particles is equal to n, and the number of i-particles is equal to m. The time complexities 
of the predictor, the calculation, and the corrector steps are 0{n), 0{nm), and 0(to), respectively. 
It is important to note, that the n^-scaling of the force calculation has been reduced to nm by the 
introduction of block time steps. As a result, the time contribution of the calculation and of the corrector 
steps decreases with block size, whereas that of the prediction step remains constant. This is important 
as block sizes are typically small on average in direct A'^-body simulations. 

Motivated by this, we implemented both the prediction of j-particles and the force calculation on 
the GPU. The prediction of z-particlcs and the corrector step are carried out on the CPU (Fig. [T]); in 
fact, GRAPE-enabled A^-body codes have exactly the same decomposition. Another motivation for this 
decomposition is the communication overhead: if the predictor step would be carried out on the CPU, all 
n particles would have to be communicated to the GPU at every time-step. In this case, communication 
would become a bottle-neck for the calculation. In our chosen implementation, only i-particles need to 
be communicated, and this therefore decreases the communication overhead. 

2.2. Predictor step 

In this step, the CPU's task is to predict positions and velocities of the j-particles. It is a rather 
trivial step to parallelise since a particle does not require information from other particles. Hence, we 
assign a particle to each of the parallel threads on the GPU. Each thread reads for its particle position 
(x), velocity (v), acceleration (a), jerk (j), and time step (dt) from global memory into registers. This 
data is required to execute the predictor step: 



dt"^ . dt^ 
Y 

dt^ 



Xprod = Xo + Vorft + ao— + jo — , (1) 



Vprcd = vo+aorft+jo — , (2) 

where subscripts "0" and "pred" refer to the initial and predicted values, respectively; both the predicted 
position and velocity are stored in the global memory for later use. The prediction of the i-particles is 
similarly implemented on the CPU. 

Prediction of particle positions must be computed in double prec ision (DP) arit hmetics because the 
integration scheme is most sensitive to round-off errors in positions (|Aarsethl . Il985h . This creates diffi- 



culties for implementing the predictor on the GPU due to the following reason: previous generations of 
NVIDIA CPUs (G80/G92) do not natively support DP arithmetics, and the current generation (GT200) 
executes DP instructions an order of magnitude slower than their single precision (SP) equivalents. How- 
ever, not all operations in Eq. [T] require double precision. Only the position is stored in DP, whereas the 
rest of the terms are stored in SP. Consequently, the multiplications and divisions can be carried out in 
SP, but the summation should be carried out in DP. This can be done most efficiently by emulating DP 
only where it is necessary, instead of implementing all of Eq.[l]in DP. As a result, the loss in performance 
is acceptable and the implementation is not dependent on hardware supporting DP arithmetics. 

The memory storage requirement for DP float (64 bit) is twice that of a SP float (32 bit). Therefore, 
it is natural to store DP numbers in two SP numbers: one of the SP numbers stores the most significant 
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Figure 2: An illustration of the decomposition between multiprocessors (MP) on the GPU. The n j'-particles are equally 
distributed between all P MPs, where P is the number of MPs. An identical copy of all m j-particles is distributed to each 
of the MPs. As a result, every MP computes gravitational forces from n/P bodies on the same set of j-particles in parallel. 



digits, the other stores the least significant digits. Such a representation of DP is known as a double- 
smg le (DSjfl. In this representation, the number of significant digits is 14 compared to 16 in DfH, but 
nevertheless it is a factor of two larger than in SP. 

In the following, we will use DS to emulate DP where it is required. In Eq. [1] this only needs to be 
done for the positions Xq and xp . All multiplications and divisions are carried out and have their results 
stored in SP. These SP numbers are then added together and the resulting number is added to the DS 
numbei[l xq. The result is stored again in DS as xp. Therefore, the floating point operation (FLOP) 
count for Eq. [T]is IDS + lOSP operations or 11 FLOP per particle; the Eq. [2] requires only 6 FLOP per 
particle. 

2.3. The calculation step 

We use predicted positions and velocities of the j-particles to calculate gravitational forces that the 
j-particles exert on the i-particles. We schematically depict the parallelisation strategy for such problem 
in Fig. [21 As before, we assume that the total number of the j-particles is equal to n and the total 
number of the i-particles is equal to m. 

We split the problem in P parallel blocks, where P is the number of parallel multiprocessors (MP) on 
the GPU. The j-particles are distributed evenly among the P MPs. Each of the MPs then computes the 



*http : //crd . Ibl . gov/~dhbailey/nipdlst/ 

^The mantissa of DP consists of 53 bits, compared to 48 bits in DS because a SP number has a mantissa of 24 bits. 
^Addition of DS with SP operation requires 10 SP floating point operations (FLOP) 
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Figure 3: Illustration of the decomposition on one multiprocessor. 1) Each thread reads in parallel one of the n/P particles 
to shared memory; the total number of shared-particles is therefore equals to m. 2) These shared particles are processed 
sequentially. Steps 1) and 2) are repeated until all the n/P particles have been processed. 
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Figure 4: Illustration of the decomposition on a multiprocessor in the case when the number of i-particles is equal to half 
the number of parallel threads that can be executed simultaneously. In this case, every particle is processed by two threads 
in order to guarantee that all threads are active. However, every thread processes half of the j-particles that are associated 
with this MP. 



partial gravitational forces, exerted by the n/P j-particles assigned to that MPs, on all the i-particles 
in parallel. This is accomplished by assigning each of the m parallel threads one of the i-particles in the 
block. Here, we assume that the number of i-particles m, is smaller than or equal to the maximal number 
of parallel threads that a block is able to execute. Should m be larger, we split the i-particles in segments 
such that each of the segments contains the maximal number of i-particles that can be processed in 
parallel; the segments themselves are processed in a serial manner. 

The P parallel blocks have no means of communication with each other, whereas the threads within a 
multiprocessor are able to exchange information via shared memory, which can be considered an equivalent 
of an on-chip cache memory. Therefore, once all the i-particles are processed, partial accelerations from 
each of the blocks must be accumulated. 

The parallelisation method that we use in each block is illustrated in Fig. [31 Each thread in a 
block loads one particle from the global memory to the shared memory, and therefore the total number 
of particles stored in shared memory (shared particles) equals to the number of threads in a block. 
Afterwards, each thread sequentially calculates and sums the partial gravitational forces from the shared 
particles. Once completed, this loop is repeated until all of n/F-particles in the block have been processed. 

If the number of the i-particles is smaller than the maximal number of threads that a block can 
execute, it is desirable to further parallelise the algorithm; we show this schematically in Fig. HI Here, 
we assume that the number of the i-particles equals to half the number of threads that can be executed 
in parallel. Thus, we split the n/P j-particles in two equal parts, such that each thread processes n/{2P) 
particles. In this way, the load per thread decreases by a factor of two, but since the total number of the 
i-particles that are processed concurrently doubles, the performance is unaffected. If we were to refrain 
from this parallelisation step, the performance would suffer since there would not be enough i-particles 
to fully occupy the multiprocessor, and half of the parallel threads would remain idle. 

The final step is to sum up the partial forces computed by each of the P parallel blocks. A simple 
approach, in which all the data from the GPU mem ory is first copied to the host and then reduced 
using the CPU, is sufficiently efficient (|Nitadoril . [2nn9h . Here, we have chosen to reduce the partial forces 
directly in the GPU memory and only copy back the final result. 

The gravitational accelerations and jerks are sensitive to a round-off errors which occur during the 
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calculation of distances between two particles. We decrease these round-off errors by calculating this 
distances using DS arithmetics. This can be understood considering the following example: if the first 
four significant digits in the position of the two particles are the same and the positions are stored in 
SP, the resulting distance has only three significant digits. The use of such low accuracy result degrades 
the quality of the accelerations and jerks, and the accuracy of the simulation overall. If, however, the 
positions are stored in DS, the number of significant digits in the distance is nine instead of three, and 
this is sufficient for an accurate integration. Since the rest of the calculations is carried out in SP, 
the significant digits beyond the seventh can be discarded, implying that the separation, even though 
computed in DS, can be stored in SP. 

The actual implementation is quite simple. One may bluntly apply formulae to calculate the diff'erence 
between two DS numbers. Such method is expensive as it requires approximately ten SP floating point 
operations to subtract two DS. However, such operation takes proper care of both the most and the least 
significant digits in the result. However, for our purpose, we can simplify the subtraction by discarding 
the least significant digits. In this way, we reduce the number of FLOPs. If the positions of two particles 
are stored in DS format, = {xi^hiSX^jo} and Xj = {x^^hi; Xj.io}, the separation Ax^j is 

Axij = (xj,hi - Xj_hi) + (xj,io - Xijo)- (3) 

The number of FLOPs required to carry out this difference is equal to 3. With this implementation, we 
will be able to resolve particles separations to one part in ten millions or better. 

This is the only operation that is carried out in DS arithmetics. Since it is usually assumed that it 
takes 60 FLOPs to calculate gravitational accelerations and jerks, the number of FLOPs that our method 
requires is therefore equals to 66. Here, we substitute each of the three SP subtraction operations with 
EqEl 



2.Jf-. Parallelization over multiple GPUs 

Our library has a built-in support for multiple GPUs installed on a single host PC. The support of 
multiple GPUs is do ne with the help of the GPUWorker library which is a part of the HDDMD molecular 



dynamics GPU-code ( Anderson et al. . 2008f l. The parallelisation is carried out automatically, based on 



the availability of multiple GPUs and the user request given in a configuration file. The application that 
uses the library is unaware of this, meaning that no modifications to source code are required for using 
multiple GPUs on one host. The parallelisation strategy in Sapporo is rather straightforward: given a 
number Q of GPUs, the library distributes n of the j-particles equally between all of the GPUs. As a 
result, each of the GPUs processes n/Q of j-particles, but the same set of z-particles. Using multiple 
GPUs results in a speed up for sufficiently large n, otherwise the smaller number of j-particles per GPU 
may result in a performance loss. 

2. 5. Interface 

We designed the library with the same application interface (API) as the standard GRAPE6 library. 
To replicate the GRAPE-6 functionality, Sapporo supports softening per block of i-particles and searches 
for nearest neighbours for each of the i-particles. This makes Sapporo suitable for high-precision coUisional 
iV-body simulations. The swapping of libraries during the linking process allows existing applications 
which already support the GRAPE6 API to be used directly with Sapporo without any changes to the 
softwar^. Compilation of Sapporo does require the presence of a CUDA-enabled GPU on the host 
computer as well as the GUDA run-time libraries. From our experience, GRAP E6-enabled A/"-bod y 
codes, such as Starlab dPortegies Zwart et al. . 2001 ). phiGRAPE or phiGRAPEch ( Harfst et al. 



200i), 



and NB0DY4 (jAarsethl . 119991 ) can be used with the Sapporo library without any modifications to their 
source code. 



3. Results 



We measure the performance of Sapporo by integrating equal mass Plummer ( Plummet. 19151) spheres 
with a various number of particles, N, for a fraction of an iV-body time unit (jHeggie and Mathieu . 



^The number of pipelines in Sapporo is 256 instead of 48 for the GRAPE6, which may require resizing some arrays if an 
Af-body code is written in a programming language with static memory allocation. 
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Figure 5: The wall-clock time as function of the number of particles using GPUs (solid line with bullets) or GRAPE-6A 
(dotted line with stars). The bullets and stars are measured from integrating a Plummer models with N particles using 
phiGRAPE. From top to bottom, one, two, or four GPUs and one, two, four and eight GRAPEs were used. The dashed-dotted 
line shows the expected A'-^-scaling, offset not to overlap with the measurements. 



1986li^ . These experiments were performed using two d ifferent codes for high precision colhsional N- 



body simulations, Starlab ijPortegies Zwart et al.l . l200lh and phiGRAPE (Harfst et al,, 2007.) . We also 



performed an integration of a A'^ = 32k Plummer sphere well beyond core collapse. 

The host computer used for these tests is equipped with one Intel Core2 Quad CPU operating at 
2.5GHz, 2GB of RAM, and two NVIDIA GeForce 9800GX2 graphics cards. Each of these cards is 
equipped with two independent G92 GPU chips. For the user the system appears to be equipped with 
four independent GPUs. We used the NVIDIA CUDA driver vl77.67, CUDA Toolkit vl.l and CUDA 
SDK vl.l and the installed operating system is Dcbian GNU/Linux 4.0 with the 2.6.21 SMP x86_64 
kernel. The library is also compatible with CUDA Toolkit v2.1. 

For comparison, some of the calculations were repe ated on GRAPE -6A, for which we used eight nodes 



of the Rochester Institute of Technology cluster ( Harfst et al.l . 120071) . Each of these nodes is equipped 



with a GRAPE-6A. The code phiGRAPE runs in parallel over several nodes on this cluster, whereas in 
the GPU setup several GPUs are hosted by a single PC. Some difference in performance can therefore 
be attributed to the inter- node communication on the GRAPE cluster, which is negligible on the single 
CPU-equipped PC. For comparing the accuracy, we have also done a number of calculations on a single 
PC with GRAPE-6A, which we refer to as GRAPE PC. This PC is equipped with a 2. 8GHz Pentium D 
processor ewith 1GB RAM and a GRAPE-6A PCI card, and running Debian 4.0 GNU/Linux with the 
2.6.18 SMP x86_64 kernel. 

3. 1 . Performance 

The results of the performance tests are presented in Fig. [3 where we show the wall-clock time as a 
function of the number of particles, N , for 1, 2 and 4 GPUs. In all the cases, the wall-clock time scales 
as expected, with iV^ and the wall-clock time decreases as the number of GPUs increases. Simulations 
with N < 64fc on 4 GPUs show a degradation in the speed-up, which is caused by the smaller number 
of particles in each block per GPU, see ij |2.41 and we notice similar behaviour for the GRAPE-6A cluster 



(jHarfst et al.l . 120071 ) . The performance of two CRAP E-6A- nodes is comparable to that of a single GPU. 



'http : //en . wikipedia. org/wiki/Natural_iiiiits#N-body .units 
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Figure 6: The computational speed (in GFlops) as a function of A'^ using GPUs (solid lines with bullets) and GRAPE-6A 
(dotted lines with stars) using the same initial conditions as using to generate Fig. [3] The number of GPUs and GRAPEs 
used for these measurements increases from bottom to top. 



In Fig. El we show the sustained performance in GFlop/4j for a range of TV. The performance of the 
GPU is better than that of the GRAPE-cluster by about a factor of two. The relatively slow PCI-bus 
used in the GRAPE nodes causes the performance for very small N to be even worse compared to the 
GPU-equipped PC. 

The average number of particles in a block time-step can also be used as a diagnostic tool. In phiGRAPE 
the block size is controlled by the standard time step criterion 
a term, that depends on the difference in force on a the particle at the beginning and end of a time 
step. This can have an effect on time steps (and therefore block size) if the forces are calculated with 
low precision. We found that small time steps tend to become much smaller in this case, as the relative 
error in the force difference becomes larger (when time steps are small, the force is usually large but it 
changes very little over this time step, resulting in the same problem as discussed before for calculating 
the distance of two particles). Smaller time steps generally result in smaller blocksizes, which in turn 
results in an inefficient use of the GPU and also increases the number of integration steps needed. 

In Fig. [7] we present the average block size as a function of N for GRAPE-6 as well as for GPU. 
The difference between the block size for GRAPE and GPU is small for the adopted standard time-step 
criterion, which indicates that the accuracy in the force calculation on the GPU is comparable to that 
of GRAPE. It is important to note, that the emulation of double precision is the key for achieving this 
accuracy. 



AarsethL 1985 ). This criterion includes 



3.2. Accuracy 

The accuracy of the force evaluation and integration of the equations of motion is crucial for a reliable 
A^-body simulation. We therefore test in this section the adopted method for achieving high accuracy on 
the GPU hardware and compare this with GRAPE-6. There are various sources of error in any A^-body 
integration: the most straightforward comes from the integration scheme itself and depends on the order 
of the integrator; another error is caused by the limited (single) precision of the GPU. The error in the 
integration is controlled via the accuracy parameter ij (jAarsethl . Il985l ). such that smaller rj results in a 



^The exact number of floating point operations required for one force calculation varies in the literature. Here we 
assumed that a single fo rce calculation requires 60 operations, which was also used for determining the theoretical peak 
performance of GRAPE l lFukushiee et al.l 120051) 
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Figure 7: The average number of particles in a block as a function of N. 



lower energy error. For a sufficiently small rj the integration errors are dominated by the limited precision 
of the hardware. 

To test accuracy of Sapporo we integrated an equal mass Plummer models with various N. For every 
N we studied the dependence of the relative energy error, dE/E = {Ei — Eo)/Eo, as a function of the 
accuracy parameter, 77, and the number of particles, N. For this purpose, we varied rj and N between 
10~^ to 0.3 and 16k to 256k respectively. Our results show that the error is strongly dependent on 77, 
whereas dependence on N is rather weak, in Fig. [H]we show the results for N = 32k and N — 128k. 

For rj > 3 ■ 10~^, the relative errors dE/E decreases with 77 and the difference between GRAPE and 
GPU is indistinguishable independent of N (see Fig. [5]). This behaviour is expected, since for relatively 
high T] the energy error is dominated by the integration errors rather than by the errors in the forces 
evaluation. For 77 ^ 3 • 10"'^, however, the integration error saturates at the point where the hardware 
limits the accuracy of the force evaluation. As expected for 7; ;S 3 • 10"'^ the error dE/E does not 
depend on 77, although the saturation level is weakly dependent on N . On the GPU, the relative error 
is dE/E ~ 10~® due to the limited precision, which is of order of the minimal relative error that can 
be achieved with single precision. The errors produced by the GRAPE-6 saturates at a somewhat lower 
level fo dE/E :^ 10~^^i w hich is expected based on the higher accuracy of the internal GRAPE hardware 



( Makino and Taiiil . llOOS n. Note, that typical values for rj are between 10 and 10 , and therefore well 



within the range where Sapporo is not limited by the single precision accuracy of the GPU. 

3.3. Star cluster evolution and core collapse 

We tested Sapporo in a production environment by calculating the time evolution of an equal-mass 
Plummer model with iV=32k particles through core collapse. The integration was performed using 
Starlab TV-body simulation environment, and we used only one GPU for this calculation. 

The results of this calculation are presented in Fig. ^ where we show time evolution of some cluster 
parameters. In particular, it is evident that the cluster reaches core collapse at t ~ 6440 A^-body time 
units. The wall-clock time it took to reach core collapse was about 24 days, whereas it took the simulation 
only 9 days to reach the point half-way to core collapse. The whole simulation, which we terminated at 
t ~ 9200 A''-body time units, was completed in about 41 days. We note, that a relatively inexpensive 
commercial GPU, which we used for this experiment, proved to be stable during the entire simulation. 

The sustained performance up to t ~ 3000A^-body time units was about 140Gflop/s, which reduced 
to about 132 Gflop/s by the time core collapse was reached, and to drop even further to about 88 Gflop/s 
in the post core collapse phase. The lower performance after core collapse is mainly caused by the 
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Figure 8: The relative error in total energy as a function of the time-step accuracy parameter r] for a Plummer model with 
32k (left panel) and 128k (right panel) particles ware integrated for 1/4 TV-body time units using phiGRAPE. The dotted line 
with open squares shows the energy error for the GPU with Sapporo, the dashed line are from GRAPE-6A. 
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Figure 9: The left panel shows time evolution of the core radius (black line) and 1, 2, 5, 10, 25, 50, 75, 90% lagrangian 
radii (green lines, from bottom to top). The right panel shows the time evolution of both the number of stars in the core 
(top) and core density (bottom). The initial condition was an equal-mass Plummer distribution with N = 32768 particles. 
The core collapse is reached at t ~ 6440. The calculation was terminated at t = 9200. The average sustained speed for the 
entire calculation was about 113 Gfiop/s. The sudden decrease of largrangian radii at t ~ 6800 is caused by the removal of 
the escaped stars. 
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presence of a dense core and hard binaries, which formed during core collapse and which require additional 
calculations performed on the host. The mean sustained performance for the entire simulation was about 
113Gflop/s. 



4. Summary 

The integration of a large self gravitating system for many relaxation times is a computationally de- 
manding procedure. Currently, the GRAPE6 special-purpose hardware is the state-of-the-art for carrying 
out such simulations. In this paper, we have presented a library named Sapporo, which emulates the 
GRAPE6 functionality on NVIDIA GPUs using the CUD A programming environment. Our library is 
publicly available and we emulate GRAPE6 interface as closely as possible (with the exception of some 
rarely used functions). The library is designed in such a way that it can be simply swapped with the 
original GRAPE6 library and, as a result, a wide range of current A^-body programs, that have been 
written using the GRAPE6 interface, can now be used on GPUs using Sapporo without any modification 
of the source code. 

We have carried out a number of basic A^-body experiments in order to test the efficiency and accuracy 
of the library and we compared our results with the GRAPE6. We found that our library is twice as 
fast as commonly used GRAPE6A/GRAPE6-BLX cards, and it is as accurate as GRAPE with standard 
integration parameters. In particular, the performance of a PC equipped with two NVIDIA GeForce 
9800GX2 cards is on par with that of a 8-node GRAPE6A cluster or a 32-chip GRAPE6 board. 

We have also carried out a production run to test the ability of the library to handle a real astrophysical 
problem. For such stress-test, we have chosen to integrate an equal- mass Plummer model beyond core 
collapse. As expected, the cluster reaches core collapse in about 15 initial half-mass relaxation times and 
afterwards enters the phase of gravothermal oscilla tions. Our results are consistent with those previously 
published in the literature (jHeggie and 2003[ ). The ability of Sapporo to handle such a demanding 



problem demonstrates that the library can be safely used for realistic astrophysical problems. 

The high performance of the Sapporo library is achieved by splitting the steps of the Hermite integra- 
tion scheme between CPU and GPU in the same way as it is done when using the GRAPE6. In particular, 
the prediction and calculation steps are carried out on the GPU, whereas the corrector is computed on 
the CPU. In this way, the force calculation is always a dominant part of the integ ration. If the predictio n 
for all particles is carried out on the host, as it is done in the kirin library (jBelleman et al. 



the integration would be dominated by the prediction step and communication, and thus degrading the 
performance. 

It is common nowadays, that a single production node is equipped with at least two GPUs. Therefore, 
we designed the library to make an efficient use of such a configuration, and the user can configure the 
library to use any combination of available GPUs. The application which makes use of the library is 
unaware that multiple GPUs are utilised, i.e. no modification to its source code is required. 

We also implemented a neighbour lists functionality in Sapporo. The maximum number of neighbours 
that a particle in a block can store is configured at compile time (default is 256), and it cannot be changed 
during runtime. Calculating and storing neighbour lists has an impact on performance of a few tens of 
percent, depending on the maximum number of neighbours allowed. On the other hand, it is also possible 
to disable neighbour hsts all together, and by this gain up to 50% in speed. 

An important part of Sapporo is the emulation of double precision using single precision numbers by 
using standard methods from the literatu re. Similar tech niques were implemented in the Yebisu A^-body 



code which also runs on NVIDIA GPUs ijNitadoril . 120091 ). The partial emulation of double precision has 



a minimal impact on performance, and places current and future NVIDIA GPUs on a very competitive 
ground with current and future GRAPE hardware. 

The emulation of double precision was required because the previous generation of NVIDIA GPUs 
(based on G80/G92 architecture) does not support it. Current NVIDIA GPUs (based on GT200 ar- 
chitecture) have native double precision support, however, only for a limited number of pipelines. This 
enables in principle the use of double precision, but the performance is an order of magnitude smaller in 
that cascj. The GT200 architecture also has almost three times as many parallel elements and twice the 
theoretical peak-performance of the previous G80/G92 architectures. Using a GT200 GPU, we were able 



''http: //www.nvidia. com/obj ect/product_tesla_sl070_us .html 
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to get a noticeable speedup of about 50%, which is smaller than the factor of two in peak performance. 
The reason is, that Sapporo was designed for the older GPU architectures, and therefore does not in the 
current version efficiently utilise all the parallel elements of a GT200 GPU. However, we are currently 
working on a new version of Sapporo which will be scalable across future generations of NVIDIA GPUs, 
including the GT200. 
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